Goto

Collaborating Authors

 constrained markov decision process


Initial Distribution Sensitivity of Constrained Markov Decision Processes

Tercan, Alperen, Ozay, Necmiye

arXiv.org Artificial Intelligence

Constrained Markov Decision Processes (CMDPs) are notably more complex to solve than standard MDPs due to the absence of universally optimal policies across all initial state distributions. This necessitates re-solving the CMDP whenever the initial distribution changes. In this work, we analyze how the optimal value of CMDPs varies with different initial distributions, deriving bounds on these variations using duality analysis of CMDPs and perturbation analysis in linear programming. Moreover, we show how such bounds can be used to analyze the regret of a given policy due to unknown variations of the initial distribution.


Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes

Montenegro, Alessandro, Cesani, Leonardo, Mussi, Marco, Papini, Matteo, Metelli, Alberto Maria

arXiv.org Artificial Intelligence

Constrained Reinforcement Learning (CRL) addresses sequential decision-making problems where agents are required to achieve goals by maximizing the expected return while meeting domain-specific constraints. In this setting, policy-based methods are widely used thanks to their advantages when dealing with continuous-control problems. These methods search in the policy space with an action-based or a parameter-based exploration strategy, depending on whether they learn the parameters of a stochastic policy or those of a stochastic hyperpolicy. We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under gradient domination assumptions. Furthermore, under specific noise models where the (hyper)policy is expressed as a stochastic perturbation of the actions or of the parameters of an underlying deterministic policy, we additionally establish global last-iterate convergence guarantees of C-PG to the optimal deterministic policy . This holds when learning a stochastic (hyper)policy and subsequently switching off the stochasticity at the end of training, thereby deploying a deterministic policy. Finally, we empirically validate both the action-based ( C-PGAE) and parameter-based ( C-PGPE) variants of C-PG on constrained control tasks, and compare them against state-of-the-art baselines, demonstrating their effectiveness, in particular when deploying deterministic policies after training.


Achieving \tilde{O}(1/\epsilon) Sample Complexity for Constrained Markov Decision Process

Neural Information Processing Systems

We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a O(\frac{1}{\Delta\cdot\epsilon}\cdot\log 2(1/\epsilon)) sample complexity bound, with \Delta being a problem-dependent parameter, yet independent of \epsilon .


Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints

Buckley, Max, Papathanasiou, Konstantinos, Spanopoulos, Andreas

arXiv.org Artificial Intelligence

This paper addresses the challenge of solving Constrained Markov Decision Processes (CMDPs) with $d > 1$ constraints when the transition dynamics are unknown, but samples can be drawn from a generative model. We propose a model-based algorithm for infinite horizon CMDPs with multiple constraints in the tabular setting, aiming to derive and prove sample complexity bounds for learning near-optimal policies. Our approach tackles both the relaxed and strict feasibility settings, where relaxed feasibility allows some constraint violations, and strict feasibility requires adherence to all constraints. The main contributions include the development of the algorithm and the derivation of sample complexity bounds for both settings. For the relaxed feasibility setting we show that our algorithm requires $\tilde{\mathcal{O}} \left( \frac{d |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^3\epsilon^2} \right)$ samples to return $\epsilon$-optimal policy, while in the strict feasibility setting it requires $\tilde{\mathcal{O}} \left( \frac{d^3 |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^5\epsilon^2{\zeta_{\mathbf{c}}^*}^2} \right)$ samples.


Review for NeurIPS paper: Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Neural Information Processing Systems

Strengths: Comments about the paper: This paper presents convergence analysis of primal-dual natural policy gradient methods under the CMDP framework. Several recent works have shown convergence of policy gradients and optimality bounds (e.g Agarwal et al., Mei et al), but the paper extends similar analysis to (a) natural policy gradients (b) CMDP framework with constraints. Overall, it archives a sublinear rate of convergence in the CMDP framework, similar to other related works with convergence analysis. The analysis of the paper is done for the general MDP case with function approximation and restricted policy classes. It is a very well written paper that is easy to follow with significant theoretical derivation and proof details.


Review for NeurIPS paper: Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Neural Information Processing Systems

After reading the authors' rebuttal, the reviewers discussed their concerns about this paper. Ultimately, a consensus was not reached asreviewer #1 feels that the issues raised in her/his review were not properly addressed in the authors' feedback. The other reviewers also share some of the concerns raised by reviewer #1, but, given the rebuttals, they believe the authors can fix them in the final version and make the contribution of their paper clearer. I agree with them and so I suggest to accept the paper, but I recommend that the authors take into consideration the issues raised in the reviews and address them carefully in the final version of the paper.


Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Neural Information Processing Systems

We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method for CMDPs which updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Even though the underlying maximization involves a nonconcave objective function and a nonconvex constraint set under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such a convergence is independent of the size of the state-action space, i.e., it is dimension-free.


Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Mazumdar, Abhijit, Wisniewski, Rafal, Bujorianu, Manuela L.

arXiv.org Artificial Intelligence

In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.


DDM-Lag : A Diffusion-based Decision-making Model for Autonomous Vehicles with Lagrangian Safety Enhancement

Liu, Jiaqi, Hang, Peng, Zhao, Xiaocong, Wang, Jianqiang, Sun, Jian

arXiv.org Artificial Intelligence

Decision-making stands as a pivotal component in the realm of autonomous vehicles (AVs), playing a crucial role in navigating the intricacies of autonomous driving. Amidst the evolving landscape of data-driven methodologies, enhancing decision-making performance in complex scenarios has emerged as a prominent research focus. Despite considerable advancements, current learning-based decision-making approaches exhibit potential for refinement, particularly in aspects of policy articulation and safety assurance. To address these challenges, we introduce DDM-Lag, a Diffusion Decision Model,augmented with Lagrangian-based safety enhancements.In our approach, the autonomous driving decision-making conundrum is conceptualized as a Constrained Markov Decision Process (CMDP). We have crafted an Actor-Critic framework, wherein the diffusion model is employed as the actor,facilitating policy exploration and learning. The integration of safety constraints in the CMDP and the adoption of a Lagrangian relaxation-based policy optimization technique ensure enhanced decision safety. A PID controller is employed for the stable updating of model parameters. The effectiveness of DDM-Lag is evaluated through different driving tasks, showcasing improvements in decision-making safety and overall performance compared to baselines.


A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

Guin, Soumyajit, Bhatnagar, Shalabh

arXiv.org Artificial Intelligence

The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms.